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ITEM SELECTION FOR CRITERION-REFEMNCED TESTS 



A criterion^referenced test is constructed to provide information on 
the performance of an exainlriGe on a set of cohereiit objectives , usually in 
terrns of mastery or non-mastery of each objective represented in the test, 
Tlie objectives represented in the test will be dirMtly or indirectly related 
to some curriculum or .segment of curriculum. In mathematics, for axample, 
the objectives may represent what is generally tauj^ht in fourth grade general 
math or specifically what is taught in a particular fourth grade math program. 
Or the objectives may represent what is to be taught in a six week course in 
life saving and water safety. Even if the objectives are not directly 
representativa of a defined curriculum as, for example^ in National Assessment^ 
there is an implication that the objectives represent some behavior that the 
eKaminee is aKpected to have learned - usually 1^ a formal school situation, A 
criterion-referenced test, then, begins with a set of objectives representing 
eome curriculum and ends with reporting performance on each of thosn objectives 
The chaiacteristics of criterion^ref erenced tests derive from this curriculum 
orientation. 

A criterion-^referencad test is intended to supply information about the 
standing of an examinee with respect to a defined or implied curriculum. If 
the test represents a reasonably long span of the curriculum, it will yield 
many scores - one for each objective covered by the test. There Is not much 
Interest or value in the total test score, since it tells you little about 
the specific achievements or deficiencies of the examinee. This is an 
obvious and major difference between criterion-referenced tests and norm- 
referenced tests. A norm-referenced test provides information about the 
standing of the eKaminee with respect to a reference or norm group and this 
can be accomplished with a single aggregate total score. The. total score in 
itself has little meaning except as a gross measure of amount of achievement 
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in a given area. Moaning for the score is derived from the norm group, just 
as the criterion-^referQnced scores derive their meaning from the curriculum 
represented. A good criterion-referenced test should discriminate well 
between mastery and nonniastery of the objectives making up the curriculum 
of interests just as a good horm^-rcferenced test should discriminate well 
between examinees who have differing amountB of achievement in the general 
area of interest. This has Implications for the way In which items are 
prepared and selected* Itema in a criterion^referenced test should be. 
sensitive to instruction; Items in a norm--ref erencsd test should be sensitive 
to individual differences, 

A crlterinh-referenced test is generally intended to be diagnostic and 
prescriptive. The test should (1) accurately reflect the examinee standing 
with respect to the curriculum, that Iss show his specific strengths and 
weaknesses 5 (2) accurately reflect changes when the exaTninee^s capability to 
perform has changed, and (3) lead to appropriate deciBions for the further 
instruction of the examinee, A norm--ref erenced test, on the other hand^ is 
generally intended to be descriptive and predictive. It should (1) accurately 
reflect the examinees standing with respect to the norm group, that is, show 
his relative position on the underlying quantity or trait being measured, and 
(2) accurately predict what the examinee will be able to do ^successfully * 
These distinctions lead to somewhat different views of reliability and validity 
for the two kinds of Instruments, The usual validity and reliability coefficients 
reported for standardlged norm-referenced tests have marginal utility for 
describing criterion-referenced tests. A criterion-referenced test should 
have demonstrable content validity and it f^hould be sensitive to appropriate 
instruction. Reliability in the usual sen^d has less Importance than the 
appropriateness of -the decisions made that effect the treatment of the examinee. 



Th.'fi goes beyond the instrument itself and leads to considerntions of miniraisinji 
risk or cost to the eKaminee. 

Tradifcionally, for norm=ref erenced testsj test construction begins 
with some sort of comprehensive rationale describing the achievement domain 
or underlying trait intended to be ineasured tmd describing the kinds of items 
Limt should be written^ frequently with eKamples. Aft^r the items are written, 
they are tried out on a sample of the target population, Item statistics are 
then computed including difficulty levels^ point bisiiirial correlations betv;een 
each item and the remaining items, and some index of internal conslstencyj 
usually a KR^20. Items are selected that have difficulties around .5s so 
they will discriminate well between examinees^ and that have high point 
biserials, so 'they will contribute to the homogeneity of the score. An attempt 
is also usually made to have the distribution of scores approximata a normal 
distribution. Normally distributed scores have valuable psychometric propertiesi 
they correlate well with other similar scores^ provide ineaningful derived scores 5 
and so on* For a crlterion-^ref erenced test 5 these statistics are still important ^ 
but of less importance than the ability of the items to indicate mastery or nonmaster 
of particular objectives after Instruction, 

A criterlon-^referenced test begins id.th a set of coherent , clearly stated 
objectives* Each objective specifically describes the bihavlor that an examinee 
will be able to perform if he has mastered the objective, that is, each objective 
specifies a limited domain of behaviors. Items are then written for each 
objective that sample as purely as possible the specified domain of behaviors* 
This' sample of behaviors will, o£ course^ not be random^ but hopefully ^ it 
will be representative of the domain. The items will then be tried out on a 
sampie of the target population/ .Traditional item statist! will ba computed 
and attention paid to them. It is more important, however j to determine if the 
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Itcins are sensitive to instruction* In order to do LhiSj a two^stage item 
tryout is requircilj tliat is, a pre-instruction administrntion of the items 
fpllowad by a period of time for instruction to occur , then a post-^instructioii 
administration of the items tc the same students # It is also necessary ,ta 
collect information as accurately as possible about tlie specific objectivea 
appearing in the test that wer^ taught to between the pre-inst ruction 
administration and the post^lnstruction administration of the Items* If 
the Instructional program Is under the control of the test constructor, this 
Information is relatively easy to obtain. If not^ it can be approxdmated by 
asking the teachers what they have taught . 

In order to select Items that are sensitive to instruction, it is yaluable 
to have sorae procedure for organising the data and some numerical index reflecting 
each Iteni's sensitivity* At CTB/McGraw^Uilly v?e have adopted a procedure described 
by Marks and Noll (1967) developed for a somewhat different purpose. First we 
obtain a two^by^two table of frequencies for each Item at pre-- and post-test like 
this: 

Post-test 
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Here the rows represent, respectively, failea and passed the item at pre^turt 
and the colums represent failed and passed the Item at post--test3 so thati 

f^ ^ the frequency of cases that failed the Itam at both pre- and 
post=test, ' : . 
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f2 = the frequency of cmm LhaC fnilcd tho J.tam at pre-^tcst, but 

passed it at poat-tcst, 
£3 - tha frequency of cases that passGtl the item at pre-test, but 
failed thu item at poat-tGst, and 
= the frequency of cases that passed the item at both pre- and' 
post- tea t . 

N = + + + s Che total numbrar of cases that were adminis tared 
the Item at both pre- and post-teat, 
Marks and Noll assume' that there is some flKed non-zero probrtility, p, 
that a student who does not know the answer to the item will guess the 
correct answer. The Valua of p Is determined by the item only and does not 
vary from student to student nor from occasion to occasion for the same student, 
that is, they admit of no partial knowledge and assume that an examinee's response 
are Independent at pre- and post-test when he. does not know the correct answer 
and fails to learn It. They also assume that the only possible result of 
exposure to Instruction between pre- and post-test is that a student learn 
the correct answer to an item. They admit of no forgetting so that a 
non-zero frequency of £3 is solely due to guessing. The "true" value of f 
is zero. With these assumptions, they then reason that f^, those people 
who failed the item at both pre- and post-test, is composed only of people 
who in fact do not know the answer after instruction, rherefore f^ Is equal to 
the probability of guessing wrong twice tln,es the numher of people In the sample 
who do not learn the answer, that is: 

where f^ is the "true" number of people who do not learn. Similarly f,, those 
people Who failed the item at pre-test and passed it at post-test, is composed 



3 



of the number of people v;ho iGnrncd the correct response and guessed wrong at 
prG'-test plus the number of people who did not learn but guessed right at tlia 
poat--tast and wrong at the pre-test| so that- 

- (1-p) + PCI-P) . (2) 

where f^ Is the '-true'' number of people in the sample who did not know 
at pre- test 5 but have learned by the post-- test. 

Next f^i those people who passed the item at the prci^-test but failed it 
the pQst--test5 is again composed solely of those who do not know nor learn the 
correct ansv/er but who guessed carrectly at the pre-test, that Is: 

/ -.p(l-p) f^ , (3) 

Finally^ f^, those people who passed the{ item at both pre^ and post-test, 
is composed (1) of all of the people who in fact know the correcit response 
at both pre--^ and post-test^ (2) the number of people who learned the answer 
and also guessed correctly at the pre^test^ and (3) the number of people 
who did not kiiow iior learn the answer, but who guessed correctly at both 
pre- and post-- test 5 that is: 

where Is the "true" number of people In the sample who know the correct 
answer at both pre-- and post^testi 
From equations (1) and (3) i 
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' h'-h ■ '(5) 



and equations (1) throuyli (4) furm a coniustcnt sy&^=Gm so that: solutions for 



Che can bu futiiid; 
i 



^1 



(5) 



£3 « 0 , 



A raClo: 
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*1 + ^2 (7) 



can serve as an index of tha de^^ree to whlcli QKaminees are selecting ^ the 
corroct: response to the lte.m as a function of the instruction recaived between 
pre- and post-^test, that is^ a sansltlvity IndeK. This indeK is simply the 
proportion of casGs th.at missed the item on the pre-test and then got it 
right on the pOBt-^test after a correction for guessing has been applied/ 

This procedure was appliad to data obtained in a two-stage item tryout 
for the Prescriptive Reading Inventory (PRI) , a criterion-^referenred reading - 
test published by CTB/McGraw-Hill in the fall of 1972. Items were selected 
to mcBasura 90 separata reading objectives and these were arranged in four 
overlapping levels of the test nominally spanning grades 1.5 through grade 6, 
Information about what had been taught : to the students in the tryout sample 



wafi obttdnad Croni a quca ti onnalro thai: was filled out by tho. teachcirs of these 
studenca at about the time of the pofst-lnstruction administration of the 
items. The qu.2stionnairo listed encli objective repreaontcd in tlia test, 
written out in full, with spacos by them to mark one of "taught before the 
pre-test," ''tauaht between the pre-test and the post-test," and "not yet 
taught." In many cases, the teachers marlced both the "taught before" and 
the "taught between" categories for particular objectives giving rise to 
an additlonfll "review" category. The item tryout data was divided into 
these four cfltegorles. 

For each item, then, for each of these four categories and for each grade 
group to whom the item was administered (two or three gr-ides), we computed 
the two-by-two table of frequencies, the corresponding table of proportions, 
the two-by-two table of corrected or estimated "true" frequencies, the 
corresponding table of proportions, and the, sensitivity index. Since more 
than 1,600 items vjere tried out, this produced an enormous amount of data. 

Theoretically, the value of the sensitivity index should be .low for the 
"taught before Che pre-tast" group, higher for the "review" group, liighoat 
for the "taught between the pre-tcst and the post-test" group, and close to 
zero for the "not yet taught" group. In our case, we rarely had enough 
cases in more than one or two of the groups to get a stable value for the 
indeK. We feel tliat, in order to get a reasonably raliable value for the 
index, that there should be at least fifty cases who missed 
the pre-tast, that is, the sum of f^ and f^ should be fifty or more. The 
cases in the f^ cell, those who passed the item at both the pre- and post-test, 
do not contribute to the calculation of the index and if the proporcion of 
cases in the cell is high, which it generally is especially for the "taught 
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beforG" and ''rcviuw" groups, than Lhr index w:i 11 be of littlfi vnlue. Where 
we were able Uo parCi^dly vnlidoUo the paLCGrn of iiidoK valuers from group to 
group, it genernlly hcald up^ eKcept that tlie values for the "taught before" 
and "not yet taught" groups tondGd to be higher than expected and tha values 
for thti "raviow" and "taught batwcan" groupK tended to be lower than 
eKpected, This may, in part, be due to the unreliability of the question- 
naire data, upon which the categorization depended. 

Table 1 shows the reaults for the '^taught between pre- and post-^test" 
group for seven Items, all of which were written to measure the objective 
"The student will be .Ma to identify coinpound words/^ The first thing you 
will notice is that as many labels as possible ware omitted to save space. 
Each 3 by 3 set of numbers is a two-^by-two table with marginals organized 
as described above. The first one at the top of the prge labelled "IF" is 
the observed frequencies. The second one down labelled '^IP" is^ the proportions 
corresponding to the observed frequenciGs. The third labelled "IF (EST)" is 
the -corrected or estimated -'true" .frequencies. The last labelled "IP (EST)" is 
the proportions corresponding to the estimated frequencies^ At the very bottom 
is the sensitivity Index labelled "D'*. These data "are somewhat better than 
typical for first graders. It is rare, in our data, to find that all items for 
an objective have acceptable values for the sensitivity index. Look now at the 
marginal proportions in the second table for those who passed the item at' 
pre-tesc and thoie that passed the item at post-^test. These are the item diffi- 
culties at pre- and post-test respectively. For item 11, the first item in the 
table, the pre-test item difficulty is ,58 and the post^test difficulty is ,80, 
This is an additional indication of the sensitivity of the item, 

/ The sensitivity index for reading tends to be higher at the lower grade 
lev-Is and higher foi discrete skills like: recognizing compound words while it 
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tQndy to ho lo'Jcx at the uppGr grades and for comprehciiyiou type Itoms. Thiy is, 
of coursG, to be expected, sinca reoding tends to convGrgc to a more or la,. 
unitary skill as practice nccuiiii'lates . 

Table 2 Bhovjs rather typical results for seven itcnis all written to 
meoBurc the Dhjectlve ''The student will be able to Identify the root word in 
words with added endings that involve spelling changes Notice that item 96, 
the next to the last item in the table, has a. negative sensitivity index. This ^ 
occurs whenever f^, the number of cases who passed tlie iteiii at the prG-test and 
failed it at the post-test, is larger than the number of cases who failed 
tha Item at the pre-test and passed it at the post-^teat. A negative index 
indicates that th...re is a serious problem with the iteiiu In ^ this case, there 
is nothing obviously wrong with the item, but looking at the pattern of frequuncies 
compared to the uther items in the set, it seems plausible that the item was miska 

Xha upper lii;dt of the index is one and it generally should not go below 
zero, though it obviously can and does. We had a few objectives the itenis for 
which all hud negative index values. In one case, for an objective having to 
do with alliteration, the item writer had been unable to write itcnis that got 
at the intent of the objective^ We subsequently decided thut the objective 
could not be reasonabay measured in a paper and pencil test and excluded it from 
the published test. In other cases, the objective was itilsplaced and the items were 
grossly inappropriate for the students who completed them. 

After selecting items using the sensitivity ii*deK as the primary critiarion 
for selection, I ran several traditional item analyses lumping all the items 
from a tryout booklet togGther to see what items would have been selected in 
the traditional way.' One set. of items was rQlated to vocabulary objactives and 
mo others were all comprchGnsion type items. I-, each case less thar half of 



thG items selcctGd. for tho crlterion^rGf cranced tost were s^jJ^ctcd for a hypo^ 
thutical norm-rcfcrcnced tant. For thci VQcabiilairy test , 23 ii.:omB were selGcted 
and of those 5 10 wera also used in tha PHI while 13 wDre not:. For one of the 
coirprehenslon testss 37 itcnis ware selected, 16 of which wera Incliidad in tiie PRI, 
For Che other, 42 items ware selected, IS of wliich were included in the PRI. 
FurthGr, the objectives were unevenly repre^rented ; in the hypothatical norin-referancad 
tests. Some objectives ware not represanted at all \^hila others had as innny 
as 8 or 10 itanis selected^ Using sensitivity to instruction as tlie major criterion 
for item selection leads to choosing a diff^jTcnt sat of itenis than would ordinarily 
be chosen. 

We also had scores for the California A c hievament Tests , 1970 Editions 
Reading Vocabulary subtest for our tryout saii:pla. Using the sat of 10 vocabulary 
related objectives^ I obtained the intarcorrelations of these with tlie CAT Reading 
Vocabulary scores and then did a stepwise regression analysis to sea how well the 
CAT could be predicted from the objective scores. Table 3 shows the intercorrelation 
matrix. Note that the highest correlation of any objactlve is .48 witii tir- vocabulary 
test* Generally they run about .40* The Intercorrelations of the objectives with 
each other average arouhd. 50* The multiple correlation with all ten objective scores 
in regress;! on only reachad »55/ Tliis tends to show rather clearly that tha two 
klndB of tests are not inuch alike and tliat scores on one m±Biit easily change with- 
out a corresponding change in the other. 



Mirk^j and Koll, G*' A, PifoCGduros and criteria lur evaiuatin^^ roading and 
listening cutnprehL:nsion tcNl:s, EJuc. and Psvcliol. Mens.^ 1967/ 27, 335-348. , 
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